956-962 Nucleic Acids Research, 2012, Vol. 40, No. 3 
doi:10.1093/nar/gkr790 



Published online 27 September 2011 



Sequence-structure relationships in yeast mRNAs 

Andrey Chursov 1 , Mathias C. Walter 2 , Thorsten Schmidt 2 , Andrei Mironov 3 ' 4 , 
Alexander Shneider 5 and Dmitrij Frishman 1,2 * 

department of Genome Oriented Bioinformatics, Technische Universitat Munchen, Wissenschaftzentrum 
Weihenstephan, Maximus-von-lmhof-Forum 3, D-85354, Freising, 2 Helmholtz Center Munich - German 
Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Ingolstadter 
LandstraRe 1, D-85764 Neuherberg, Germany, 3 Department of Bioengineering and Bioinformatics, Moscow 
State University, Leninskie Gory, GSP-1, 119991, institute for Information Transmission Problems, Russian 
Academy of Sciences, Bolshoi Karetny pereulok 19, 127994, Moscow, Russia and 5 Cure Lab, Inc., 43 Rybury 
Hillway, Needham, MA 02492, USA 

Received July 3, 2011; Revised August 31, 2011; Accepted September 8, 2011 



ABSTRACT 

It is generally accepted that functionally important 
RNA structure is more conserved than sequence 
due to compensatory mutations that may alter the 
sequence without disrupting the structure. For small 
RNA molecules sequence-structure relationships 
are relatively well understood. However, structural 
bioinformatics of mRNAs is still in its infancy due 
to a virtual absence of experimental data. This 
report presents the first quantitative assessment 
of sequence-structure divergence in the coding 
regions of mRNA molecules based on recently pub- 
lished transcriptome-wide experimental determin- 
ation of their base paring patterns. Structural 
resemblance in paralogous mRNA pairs quickly 
drops as sequence identity decreases from 100% 
to 85-90%. Structures of mRNAs sharing sequence 
identity below roughly 85% are essentially uncor- 
rected. This outcome is in dramatic contrast to 
small functional non-coding RNAs where sequence 
and structure divergence are correlated at very low 
levels of sequence similarity. The fact that very 
similar mRNA sequences can have vastly different 
secondary structures may imply that the particular 
global shape of base paired elements in coding re- 
gions does not play a major role in modulating gene 
expression and translation efficiency. Apparently, 
the need to maintain stable three-dimensional 
structures of encoded proteins places a much higher 
evolutionary pressure on mRNA sequences than on 
their RNA structures. 



INTRODUCTION 

Secondary structure elements both in the untranslated 
(UTR) and coding (CDS) regions of mRNAs have been 
implicated in a variety of regulatory functions (1). For 
example, riboswitches modulate gene expression through 
conformational changes in response to various stimuli (2). 
In addition, translation initiation, elongation, termination 
and translation efficiency all depend on higher order 
mRNA secondary structures in non-coding regions (3,4). 
Coding region hairpins have also been suggested to play a 
role in the regulation of translation (5). The relationship 
between RNA structure and gene expression in the coding 
regions of mRNAs has been demonstrated both computa- 
tionally and experimentally (6-10). In particular, reduced 
mRNA stability near the start codon has been observed in 
a wide range of species, probably as a mechanism to fa- 
cilitate ribosome binding or start codon recognition by 
initiator tRNA (11). Computational studies show that 
native mRNA sequences have lower folding energies and 
hence more stable structure than codon-randomized ones 
(5). The three mRNA functional domains— 5' -UTR, CDS 
and 3'-UTR — form largely independent folding units, with 
base pairing across domain borders being rare (12). 
Evolutionary conserved local secondary structures have 
been identified in the CDS regions (13,14) and shown to 
be functional (15). 

There is a selective pressure toward maintaining both 
stable RNA structures of coding regions and the 
three-dimensional folds of their encoded proteins (16). It 
has been argued that the redundancy of the genetic code 
plays an important role in satisfying these selection re- 
quirements (12). In general, however, sequence-structure 
relationships in mRNA-coding regions remain elusive; 
and, their spatial structure is unknown. While hundreds 
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of atomic resolution structures have been determined 
for smaller RNA molecules, most notably tRNAs, experi- 
mental structures of large RNAs are still rare (17). Until 
recently, direct experimental determination of mRNA struc- 
ture has been impossible on a large scale. Furthermore, 
most insights into the evolutionary constraints acting on 
them arose from correlating predicted base paring 
patterns with the effects of site-directed mutagenesis on 
mRNA expression and degradation, as well as on the ex- 
pression levels and activity of encoded protein products. 

Significant progress has been made in predicting RNA 
secondary structure from sequence based on free-energy 
minimization (18), probabilistic models (19) and evolu- 
tionary information (20). However, the accuracy of current 
algorithms is still insufficient to model large molecules, 
primarily because the number of theoretically possible 
RNA secondary structures grows exponentially with the 
length of the sequence (21). Also, the free folding energy 
of millions of suboptimal structures is very close to the 
most stable structure. Lowest energy structures may not 
necessarily reflect folding in vivo (22) due to kinetic pro- 
cesses and protein-RNA interactions. Additionally, it is 
hard to model pseudoknots and unstructured regions (23). 

More accurate prediction of RNA secondary structure 
can be achieved by using experimental constraints 
obtained from oligonucleotide data to guide free-energy 
minimization (24). Moreover, experimental methods have 
been developed that allow comprehensive monitoring of 
RNA structure at single nucleotide resolution. One such 
method, fragmentation sequencing, allows for recon- 
structing RNA structures by sequencing fragments of 
single-stranded RNA resulting from nuclease digestion. 
Another method, known as selective 2'-hydroxyl acylation 
and primer extension (SHAPE) (25), exploits the sensitiv- 
ity of selective acetylation of the ribose 2'-hydroxyl 
position to local nucleotide flexibility, thereby allowing 
identification of those nucleotides that are conforma- 
tionally constrained by base pairing. Accurate SHAPE- 
directed RNA structure determination has been reported 
for several types of RNA molecules, including Escherichia 
coli 16S RNA and yeast tRNA asp (26), as well as for the 
entire HIV-1 genome (27). This latter work highlighted the 
intricate relationship between RNA sequences and protein 
structure of the encoded proteins. In particular, it was 
found that flexible loops in protein structures correspond 
to highly structured RNA elements, implying a functional 
role of mRNA structure in the modulation of ribosome 
processivity at domain boundaries. 

In recent work, Kertesz and colleagues (28) reported the 
first transcriptome-wide experimental analysis of mRNA 
structures using the novel technology called parallel 
analysis of RNA structure (PARS). PARS enables the de- 
termination of base pairing probabilities at single nucleo- 
tide resolution by refolding RNAs in vivo, treating them 
with structure-specific enzymes and then sequencing the 
resulting fragments. Structural profiles were obtained for 
more than 3000 transcripts from the budding yeast 
Saccharomyces cerevisiae. The work of Kertesz et al. 
revealed higher degree of structuredness in the mRNA- 
coding regions compared with the 3'- and 5'-untranslated 
regions, implying a functional role of RNA structure in 



coding regions in regulating gene expression. The global 
data set of PARS profiles represents a true treasure trove 
for investigating sequence-structure and structure- 
function relationships in mRNAs. 

This report provides the first comprehensive analysis of 
sequence-structure relationships in the coding regions of 
yeast mRNAs based on base pairing propensities 
measured by the PARS technology. It was found that 
PARS profiles of paralogous mRNAs show very strong, 
essentially linear, correlation sequence for identity levels 
upwards of 85-90%. Yet, pairs of more distantly related 
yeast transcripts secondary structure appear to be unre- 
lated. Interestingly, predicted secondary structures of 
yeast paralogs display a similar behavior with respect to 
sequence identity; and, there is a significant correlation 
between experimental and theoretical structures, as 
noted previously (28). Theoretical structures of ortho- 
logous mRNA pairs from yeast and Candida glabrata 
are also uncorrelated for low sequence identity levels 
while for highly similar sequences no conclusion could 
be made due to lack of data. 



MATERIALS AND METHODS 

Experimental data on yeast mRNA secondary structure 

Secondary structure profiles of 3000 transcripts from the 
budding yeast S. cerevisiae have recently been determined 
using a novel experimental strategy called PARS (28). For 
each individual nucleotide position of mRNAs, a PARS 
score reflects its likelihood to be in a double-stranded con- 
formation. PARS scores for yeast transcripts were down- 
loaded from http://genie.weizmann.ac.il/pubs/PARS10. 
5'- and 3'-UTR regions were identified by sequence com- 
parison with yeast amino acid sequences, and then ex- 
cluded from consideration. In the following, a vector of 
PARS scores for a given transcript is referred to as its 
experimental structure. 

Yeast paralogs 

Data on paralogous yeast proteins were kindly 
provided by Martin Miinsterkotter and Ulrich Giildner 
from the fungal genomics group at the Institute for 
Bioinformatics and Systems Biology (German Research 
Center for Environmental Health, Munich). A list of 
protein pairs sharing significant similarity (identity at the 
amino acid level >50%) was extracted from the SIMAP 
database (29). Additionally, the putative paralogs were 
required to have not >10% difference in sequence 
length. In total, 243 paralog pairs involving 409 different 
yeast genes satisfied these conditions. 

Amino acid sequences of paralogous yeast proteins were 
globally aligned using the ggsearch program from the 
FASTA software suite (30). Amino acid sequence align- 
ments were subsequently converted into mRNA sequence 
alignments; and, the percent identity between each pair of 
coding regions was calculated by dividing the number of 
identical nucleotides by the length of the alignment. 
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Orthologs from C. glabrata 

Sequence data for C. glabrata were downloaded from the 
PEDANT genome database (31). A list of orthologous 
protein pairs between 5. cerevisiae and C. glabrata was 
extracted from the eggNOG database (32). In total, we 
obtained 2327 ortholog pairs. The alignment procedure 
was the same as for paralogs, see above. 

PARS score distances between yeast paralogs 

To assess global structural similarity between pairs of 
aligned mRNA sequences, root mean square deviations 
(RMSDs) between vectors of PARS scores were calculated 
for all alignment positions that did not contain gaps. 
Additionally, for each transcript pair, profiles of local struc- 
tural similarity were obtained by calculating RMSDs 
between PARS scores in non-gapped alignment positions 
within a sliding window of varying length, typically 
between 100 and 1000 nt. 

Prediction of mRNA secondary structures 

For each nucleotide position of transcript sequences, the 
theoretical probability to be in double-stranded conform- 
ation was calculated using the RNAfold method from the 
Vienna RNA package (33). As done similarly for experi- 
mental PARS scores (see above), RNAfold probability 
values were used to calculate global and local measures 



of structural similarity between aligned coding regions of 
mRNAs based on RMSD. For brevity, a vector of pre- 
dicted probabilities of RNA bases in double-stranded con- 
formation for a given transcript is further referred to as its 
theoretical structure. 

Data availability 

All sequence alignments together with experimentally 
determined and predicted structures are available in 
Supplementary Data. 

RESULTS 

By illustrating the data used in this study on a concrete 
example, the research results can be readily presented. 
Two yeast mRNA sequences, YBR092C and YBR093C, 
share 86.5% sequence identity, and their partial align- 
ment is depicted in the top part of Figure 1. The 
position-dependent PARS scores for both sequences are 
shown in the middle part of Figure 1 . Both graphs display 
a rather high degree or correlation, albeit not perfect. In 
the bottom part of Figure 1 , theoretical structures (prob- 
abilities for individual bases to be paired) are drawn along 
the sequence. Figure 2 shows how distances between ex- 
perimental and theoretical structures of YBR092C and 
YBR093C vary along the mRNA sequence dependent 
on sequence identity in a local sequence window. As 



YBR092C atgtttaaGtctgttgtttattcGGttCtagccgctGctttAgTTaatgc 
YBR093C atgtttaaAtctgttgtttattcAAttTtagccgctTctttGgCCaatgc 



o o 



03 

5 0 

D_ 



-2 



1.0 



|0,6 

CO 

.o 

o 

o: 0.4 



0.2 



0.0 



10 20 30 40 



50 



Position 



actgggactggaaTactactcaCtacaacgATaCCctattAaAacaataA 
actgggactggaaCactactcaTtacaacgCCaGTctattGaGacaataG 



-YBR092C 
•YBR093C 



1360 1370 1380 1390 1400 



Position 



Figure 1. Sequence alignment, experimental and theoretical structures of the first and last 50 nt for the pair of yeast mRNA sequences YBR092C 
(dashed lines) and YBR093C (dotted lines). 
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expected, highly similar regions generally correspond to 
more similar structures. 

Calculations exemplified in Figures 1 and 2 were per- 
formed for all pairs of paralogous mRNA sequences in 
our data set. Table 1 summarizes pair-wise correlations 
between the three evolutionary measures considered in 
this work for different ranges of sequence identities. 
Figure 3 a shows how the difference between experimental 
structures depends on sequence similarity. PARS scores 
appear to be entirely uncorrelated for identity levels of 
up to ~85-90%. In this sequence identity range, the me- 
dian RMSD between PARS score vectors does not differ 
from the median calculated for randomly selected mRNA 
pairs (dashed horizontal line in Figure 3a). For sequence 
identity levels over 85-90%, the distance between experi- 
mental structures shows essentially a linear dependence 
from sequence similarity (Supplementary Figure SI). 

Upon conducting the same experiment with pairs of 
theoretical structures of yeast mRNAs, it was found that 
the distance between the structures also begins to depend 
on sequence similarity upward of roughly 85-90% identity 
(Figure 3b). For pairs with identity between sequences 
within the range from 97.5% to 100%, the median 
distance between theoretical structures constitutes 38% 
of the random level. Yet, for experimental structures, it 
is lower at 29%. The link between sequence and structure 
is thus stronger when experimental structures are con- 
sidered. The distance between theoretical structures also 
shows a linear dependence from sequence similarity for 
sequence identity levels over 85-90% (Supplementary 
Figure S2). 



Therefore, what is the significance of the sequence- 
structure dependence shown in Figure 3; and, how 
would it appear for codon-randomized mRNA sequences? 
Since experimental PARS scores are not available for 
randomly generated sequences, this issue could only be 
assessed for theoretical structures. For each pair of 
paralogs, one sequence was kept unchanged. In the 
second mRNA, however, mutations were randomly 
distributed along the sequence, keeping the encoded 
amino acid sequence, the codon usage and the total 
number of mutations between the paralogs unchanged. 
Overall, the divergence of structures between codon- 
randomized paralogs displays virtually the same depend- 
ence on sequence similarity as for native sequences 
(Supplementary Figure S3). 

We also compared predicted structures between 
orthologous mRNAs from S. cerevisiae and the pathogen- 
ic yeast C. glabrata (Figure 4). Although C. glabrata is the 
most closely related organism to S. cerevisiae with a com- 
pletely sequenced genome (34), no pair of orthologous 
mRNAs between these two organisms shares sequence 
identity >95% and thus no conclusion about structure 
divergence for very similar sequences could be made. 
However, for lower identity levels theoretical structures 
of orthologs are uncorrelated and thus behave the same 
way as paralogous structures. 

DISCUSSION 

In some sense, the current situation in RNA bioinformat- 
ics is reminiscent of the early days of structural 
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Figure 2. The profile of local structural similarity versus local sequence identity for the pair of yeast mRNA sequences YBR092C and YBR093C. 
The length of the sliding window is 300. The global sequence identity between these two sequences is 86.5%. 
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Table 1. Correlation coefficients and P-values for different ranges of sequence identity 
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Figure 3. Boxplots of distances between structures of aligned paralogous mRNAs in different ranges of sequence similarity. Each box corresponds to 
the range of similarity 2.5%. The box extends from the lower to the upper quartile values, with a horizontal line at the median value. Whiskers 
demonstrate the entire range of the data. Crosses show outliers, (a) Distances between experimental structures. The average level of PARS score 
distances for alignments of random sequence pairs is 2.14 (dashed line), (b) Distances between theoretical structures. The average level of probability 
distance for alignments of random sequence pairs is 0.5 (dashed line). 



bioinformatics of proteins, when the availability of a suf- 
ficiently large data set of X-ray structures allowed for the 
first comprehensive analysis of the relation between the 
divergence of sequence and structure in proteins (35). 
Until recently, studies of the evolutionary conservation 
of RNA structures were based on in silico predictions 
and largely limited to non-coding RNA. In the first 
large-scale study, Schudoma et al. (36) determined that 
in short RNA loops with known three-dimensional struc- 
tures sequence identity >75% implies significant struc- 
tural similarity. The most comprehensive investigation of 
sequence-structure relationships in RNA molecules to 
date is based on all-against-all pair-wise structural com- 
parison of non-coding RNAs (tRNAs, rRNAs, 
riboswitches and riboswitches) with known spatial archi- 
tectures (37). Assessment of evolutionary divergence 
revealed that the correlation between sequence and sec- 
ondary structure conservation is highly significant for 
sequence identity levels in the range between just a few 
percentage points up to roughly 60% where this relation- 
ship saturates. Further increase of sequence similarity (60- 
100%) does not lead to an appreciable growth of second- 
ary structure similarity. None of the studies mentioned 



above considered mRNAs because no mRNA structures 
are currently known at atomic resolution. 

The principal finding of this research is that the correl- 
ation between sequence and structure in the coding regions 
of yeast mRNAs is much weaker than in small non-coding 
RNAs. Up to ~85-90% sequence identity, the similarity 
of both experimental and theoretical base pairing pro- 
pensities between paralogous yeast mRNAs is at random 
level; while, for more similar sequence pairs, sequence and 
structure are strongly correlated. This may imply that 
mRNAs do not experience a strong selective pressure to 
preserve a certain degree of structuredness. The fact that 
codon-randomized sequences display a similar behavior 
also indicates that there is no appreciable evolutionary 
pressure to preserve a particular RNA structure as long 
as the encoded protein remains unchanged. Taken 
together, these results underscore a high degree of evolu- 
tionary neutrality in yeast mRNA molecules, both at the 
level of primary (third codon position) and secondary 
(extent of base paring) structure. 

On one hand, our findings are in strong contrast to 
many non-coding RNAs and c«-acting regulatory 
elements of mRNAs whose biological function is primarily 
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Figure 4. Boxplot of distances between theoretical structures of aligned 
orthologous mRNAs in different ranges of sequence similarity. 
Notation as in Figure 3. 



to conduct comparative analyses of mRNA structuromes 
[the term coined by Westhof and Romby (44)], focusing 
on orthologous sequences from multiple organisms and 
taking into account important genomic variables, such 
as expression level and evolutionary rate. Given the cur- 
rent pace of high-throughput RNA analysis technologies 
there is no doubt that such data will become available in 
the near future. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary figures S1-S3. 
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mediated by their spatial architecture (38) stabilized by 
tertiary interactions, modified bases and interactions 
with proteins and small ligands. On the other hand, 
sequence-structure relationships observed in this work 
are compatible with the notion that, in general, RNA mol- 
ecules do not have a single global structure. Instead, they 
exist as a highly dynamic ensemble of alternative conform- 
ations (39,40) that are often capable of performing differ- 
ent functions (41). The extent of base pairing may play a 
role in the regulation of pre-mRNA splicing, translation 
and mRNA degradation. Both experimentally determined 
PARS scores and computationally derived partition func- 
tions analyzed in this work are statistical measures that 
reflect the propensity of each nucleotide to form a base 
pair across a large number of metastable structures. 

This analysis has several important limitations. First, 
PARS probes RNA structures in vitro rather than in the 
living cell and may not always reproduce functional RNA 
structures (42). Second, even if the base paring informa- 
tion obtained by the PARS technology were perfectly cor- 
rect, it still merely represents a one-dimensional profile of 
structural propensities, a far cry from knowing the actual 
RNA secondary structure, let alone spatial architecture, 
for each individual molecule at any moment of time. 
Third, the findings do not rule out much stronger 
sequence-structure correlations in certain local structural 
elements of coding regions, such as reprogrammed 
genetic-decoding signals (43) or mRNA localization 
signals. We also cannot rule out the possibility that the 
degree of mRNA structuredness does have an important 
functional role in spite of quick erosion of structural simi- 
larity between paralogs with diminishing sequence similar- 
ity, and that this erosion reflects functional differentiation. 
However, we consider such explanation unlikely because 
the same behavior is observed between orthologous 
mRNAs. Finally, only a small subset of the PARS data 
constituted by pairs of sequence similar yeast mRNAs 
(paralogs) was explored. As a next step, it will be exciting 
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